Time Series Analysis of Spotify Data
Introduction
Companies are always looking for methods to analyze the popularity of their products, as well as what predicts that popularity over time. The music industry is no different: Spotify has developed a popularity index that ranks songs from 0 to 100 on how popular they are compared to other songs. The score has three components: total streams of a song, how recently a song has been played, and the frequency at which a song is played. This score is immensely useful to Spotify and to music industry professionals on a song-by-song basis, because having a high popularity score increases a song’s discoverability to listeners. However, modeling mean values of song popularity over time also has the potential to be useful for Spotify (or similar streaming platforms), because it could help companies understand the factors that play into users’ satisfaction with the platform and listening patterns over time, as well as predict future values of popularity that might inform business decisions. We have obtained a Kaggle dataset containing information about audio features of more than 160,000 Spotify songs that were released between 1921 and 2020. While several previous analyses have looked at this data and even examined the relationships between features and popularity, these projects have ignored the implicit potential for temporal autocorrelation in the data. Because the total number of listens for a given song inevitably increases over time, and because frequency and recency of listens depend in large part on media attention and artist activity, the mean song popularity for songs released in one month is likely to be highly related to that of songs released the previous month. Our goal for this project is to create a model of the mean popularity for songs released at given intervals over time using song feature data as predictors and ARIMA errors to account for leftover temporal autocorrelation.
Data Preparation and Exploratory Data Analysis
The original dataset for this project contains nineteen variables, including song name, artist, name, release date, and audio features including acousticness, danceability, loudness and energy. A full codebook of all variables and their definitions is located in the data folder of this repository (spotify_data_codebook.md). The original data contained audio features and identification information for 169,909 songs.
Data Preparation
First, we altered the structure of our data, creating a new dataset (spotify_mean_yearly) that contains each year from 1921 to 2010 and the mean values of the following feature variables: acousticness, danceability, duration, energy, explicit, instrumentalness, key, liveness, loudness, mode, popularity, speechiness, tempo, and valence. We also created a dataset at monthly granularity to establish a regular time interval off which to model. Notably, filtering the release date for only observations that contain monthly data meant we disincluded about 30% of observations (50382 out of 169909 originally), but we don’t have a strong reason to believe that this aggregation will produce bias in the data, given that the time series for yearly mean popularity of songs with and without monthly data matched (see plot).
EDA: Trends over time
In order to get a general sense of the trends in this data, we plotted the mean yearly values of each of the song feature variables.
Of note, mean song popularity increased steeply over time. This probably has to do with the rise of the internet and free music streaming, which happened during the last couple of decades. However, mean song popularity decreased sharply in the last couple of years– perhaps because not as many people have had the chance to listen to certain songs yet. Mean loudness and energy have generally increased over time, while mean song acousticness has dropped. Mean song liveness and instrumetalness had moderate downward trends. Mean song duration rose over time but has dropped again in the last couple of decades. Song danceability fell from the 1920s to the 1950s but has been on an upward trajectory ever since.
Now, we plotted the plot the mean monthly values of each of the song feature variables to examine these trends at a more granular level.
After plotting at a monthly timescale, we can see more nuance in these trends. For instance, the downward trend in liveness that we observed in the yearly time series for liveness appeared to be more related to a couple of months early in the century with particularly high liveness that biased the averages. Similarly, for instrumentalness, there seemed to be much more variance in mean monthly values early in the century that created the illusion of a crisper downward trend. Because these monthly time series apear to give us far more information than yearly time series, at this point we chose to create a model that explains monthly mean popularity based on monthly mean predictor values rather than using a yearly temporal granularity. Additionally, the ability to predict monthly mean popularity would likely be more useful for a company than yearly in terms of making timely and specific business decisions.
Bivariate EDA: Relationships between popularity and song features
To continue our EDA and check which of these covariates might have a relationship with our outocme variable, song popularity, we plotted each variable against the response at monthly granularity. We also fit a rudimentary linear model (black) and loess curve (green) to each.
The bivariate EDA (grouped rowwise by behavior), revealed some trends between the predictor and mean monthly song popularity. Danceability, Loudness, and Energy all had relatively strong positive trends with mean song popularity. Instrumentalness, Acousticness, and Liveness all had negative trends with song popularity (higher values of the predictor giving lower song popularity, and vice versa). Speechiness, Duration, and Tempo all had values gathered in one area with various popularity levels, and explicitness had a moderate positve trend with mean song popularity. The loess curve reveals some strange behavior at the very high or very low values of all eight, likely due to the fact that these more extreme popularity values are uncommon.
spotify_decades <- spotify_mean_monthly %>%
mutate(year = as.numeric(format(month,'%Y'))) %>%
mutate(decade = year - year %% 10 )spotify_decades %>%
ggplot(aes(x=danceability_month_mean, y = popularity_month_mean)) +
geom_point() +
geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") +
geom_smooth(color = "red", alpha = 0.25, se = FALSE) +
facet_wrap(~decade)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
spotify_decades %>%
ggplot(aes(x=loudness_month_mean, y = popularity_month_mean)) +
geom_point() +
geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") +
geom_smooth(color = "red", alpha = 0.75, se = FALSE) +
facet_wrap(~decade)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
spotify_decades %>%
ggplot(aes(x=energy_month_mean, y = popularity_month_mean)) +
geom_point() +
geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") +
geom_smooth(color = "red", alpha = 0.75, se = FALSE) +
facet_wrap(~decade)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
spotify_decades %>%
ggplot(aes(x=instrumentalness_month_mean, y = popularity_month_mean)) +
geom_point() +
geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") +
geom_smooth(color = "red", alpha = 0.75, se = FALSE) +
facet_wrap(~decade)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
spotify_decades %>%
ggplot(aes(x=acousticness_month_mean, y = popularity_month_mean)) +
geom_point() +
geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") +
geom_smooth(color = "red", alpha = 0.75, se = FALSE) +
facet_wrap(~decade)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Checking cross correlation
#lag correlations of tempo on popularity
spotify_mean_monthly %>%
select(month, tempo_month_mean, popularity_month_mean) %>%
mutate(
"lag 0" = lag(tempo_month_mean,0),
"lag 1" = lag(tempo_month_mean,1),
"lag 2" = lag(tempo_month_mean,2),
"lag 3" = lag(tempo_month_mean,3),
"lag 4" = lag(tempo_month_mean,4),
"lag 5" = lag(tempo_month_mean,5),
"lag 6" = lag(tempo_month_mean,6),
"lag 7" = lag(tempo_month_mean,7),
"lag 8" = lag(tempo_month_mean,8),
"lag 9" = lag(tempo_month_mean,9),
"lag 11"= lag(tempo_month_mean,11),
"lag 10"= lag(tempo_month_mean,10)) %>%
select(-tempo_month_mean, -month) %>%
tidyr::gather(lag, tempo_month_mean, -popularity_month_mean) %>%
group_by(lag) %>%
summarize(corr = cor(tempo_month_mean, popularity_month_mean, use="complete.obs"))# A tibble: 12 × 2
lag corr
<chr> <dbl>
1 lag 0 0.496
2 lag 1 0.499
3 lag 10 0.501
4 lag 11 0.493
5 lag 2 0.494
6 lag 3 0.495
7 lag 4 0.500
8 lag 5 0.501
9 lag 6 0.493
10 lag 7 0.498
11 lag 8 0.495
12 lag 9 0.498
#plot lag CCF
forecast::ggCcf(spotify_mean_monthly$tempo_month_mean, spotify_mean_monthly$popularity_month_mean)Fitting preliminary ARIMA model to Popularity scores
As part of our EDA, we used a step-by-step process to create an ARIMA model of mean popularity scores over time by year from 1921 to 2020. Ultimately, we will attempt to use an ARIMA component to account for left over temporal autocorrelation of the residuals in our linear model of mean song popularity based on mean song features for a given year; this initial ARIMA model may help us to inform that component by showing us the temporal patterns in song popularity over time.
# covert dataframe to time series
spotify_ts <- xts(spotify_mean_monthly$popularity_month_mean, spotify_mean_monthly$month)
spotify_ts <- as.ts(spotify_ts)
# forecast ARIMA
step1 <- forecast::ggtsdisplay(spotify_ts, points = FALSE)Based on the time series visualization, we will begin by applying first-order differencing with an \(ARIMA(0,1,0)\) model. We will also store the root mean squared error value to track our progress while iterating through ARIMA models.
# first-order differencing
step2 <- forecast::Arima(
spotify_ts, order = c(0, 1, 0)
)
forecast::ggtsdisplay(step2$residuals, points = FALSE, lag.max = 36)# store rmse value
step2_rmse <- yardstick::rmse_vec(
spotify_ts %>% unclass(),
step2$fitted %>% unclass()
)The time series has somewhat improved, but there remains some patterns that we can alleviate with an AR process. We will apply a MA(2) process on top of the first-order differencing with an \(ARIMA(0,1,2)\) model.
# first order differencing
# MA(2)
step3 <- forecast::Arima(
spotify_ts, order = c(0, 1, 2)
)
forecast::ggtsdisplay(step3$residuals, points = FALSE, lag.max = 36)# store rmse value
step3_rmse <- yardstick::rmse_vec(
spotify_ts %>% unclass(),
step3$fitted %>% unclass()
)The AR component somewhat improved the model, bringing the root mean squared error from 3.47 to 2.62. We can confirm our findings using the auto.arima function:
# auto arima on time series
forecast::auto.arima(spotify_ts)Series: spotify_ts
ARIMA(0,1,2) with drift
Coefficients:
ma1 ma2 drift
-0.8890 0.0515 0.0659
s.e. 0.0315 0.0315 0.0130
sigma^2 = 6.753: log likelihood = -2527.33
AIC=5062.67 AICc=5062.7 BIC=5082.55
# model fit
model = "Final Model - forecast::Arima (0,1,2) "
rmse = (spotify_ts-step3$fitted)^2 %>% mean() %>% sqrt() %>% round(3) %>% paste0(" [RMSE: ", . ,"]")
step3 %>%
{tibble(
spotify_ts = spotify_ts %>% unclass(),
model = .$fitted %>% unclass(),
time = time(.$fitted) %>% unclass()
)} %>%
tidyr::gather(var, popularity_year_mean, -time) %>%
mutate(var = forcats::as_factor(var)) %>%
ggplot(aes(x=time, y=popularity_year_mean, color=var)) +
geom_line(alpha=0.75, size=0.8) +
labs(title = paste(model, rmse), x = "Time", y = "Average Popularity Score",
color = "")Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
After applying first order differencing and an MA(2) component, the model fit appears to be finished! The ACF and PACF do not appear to have a pattern after the final step, and the final model against the given data looks sufficient as well. The model for this data required the following parameters: \(ARIMA(0,1,2)\), and the final root mean squared error is 2.62.
# report ARIMA model
m = as_tsibble(spotify_ts) %>%
model(ARIMA(value ~ pdq(0, 1, 2)))Model fitting
Our goal in creating this model is to predict monthly mean song popularity based on monthly mean song features. From our EDA, we can see that danceability, energy, explicitness, instrumentalness, loudness, and tempo had linear associations with monthly mean song popularity. We will start by fitting a multiple linear regression model with mean monthly popularity as the outcome, the features identified in EDA as predictors, and no temporal component.
# fit model
model1 <- lm(popularity_month_mean ~ danceability_month_mean + energy_month_mean + explicit_month_mean +
instrumentalness_month_mean + loudness_month_mean + tempo_month_mean, data = spotify_mean_monthly)
# output model
summary(model1)
Call:
lm(formula = popularity_month_mean ~ danceability_month_mean +
energy_month_mean + explicit_month_mean + instrumentalness_month_mean +
loudness_month_mean + tempo_month_mean, data = spotify_mean_monthly)
Residuals:
Min 1Q Median 3Q Max
-88.192 -4.858 0.811 5.789 24.988
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.92776 4.95018 -5.036 5.59e-07 ***
danceability_month_mean 5.66734 4.88026 1.161 0.245791
energy_month_mean 78.90610 3.54047 22.287 < 2e-16 ***
explicit_month_mean 38.47577 3.03301 12.686 < 2e-16 ***
instrumentalness_month_mean -24.56021 1.98068 -12.400 < 2e-16 ***
loudness_month_mean -0.62190 0.17437 -3.567 0.000378 ***
tempo_month_mean 0.06526 0.03887 1.679 0.093484 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.34 on 1059 degrees of freedom
Multiple R-squared: 0.7456, Adjusted R-squared: 0.7442
F-statistic: 517.3 on 6 and 1059 DF, p-value: < 2.2e-16
# store RMSE and AIC
model1_rmse <- yardstick::rmse_vec(
spotify_mean_monthly$popularity_month_mean,
model1$fitted
)
model1_aic <- AIC(model1)The model performed decently, with an root mean squared error or 10.309 and an AIC of 8015.144. We want to include some temporal component, so we will move forward by adding a temporal component to our existing model with a cross correlation function lag term on the tempo predictor.
# fit model
model2 = lm(popularity_month_mean ~ danceability_month_mean + energy_month_mean + explicit_month_mean +
instrumentalness_month_mean + loudness_month_mean + tempo_month_mean +
lag(tempo_month_mean, 10),
data = spotify_mean_monthly)
# output model
summary(model2)
Call:
lm(formula = popularity_month_mean ~ danceability_month_mean +
energy_month_mean + explicit_month_mean + instrumentalness_month_mean +
loudness_month_mean + tempo_month_mean + lag(tempo_month_mean,
10), data = spotify_mean_monthly)
Residuals:
Min 1Q Median 3Q Max
-81.740 -4.630 0.888 5.828 22.769
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.08750 5.57833 -9.696 < 2e-16 ***
danceability_month_mean 4.93714 4.66043 1.059 0.28967
energy_month_mean 66.87425 3.56841 18.741 < 2e-16 ***
explicit_month_mean 44.24741 2.97637 14.866 < 2e-16 ***
instrumentalness_month_mean -22.81283 1.93954 -11.762 < 2e-16 ***
loudness_month_mean -0.48275 0.17034 -2.834 0.00468 **
tempo_month_mean 0.05469 0.03711 1.474 0.14079
lag(tempo_month_mean, 10) 0.32335 0.03186 10.149 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.75 on 1048 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.7718, Adjusted R-squared: 0.7702
F-statistic: 506.2 on 7 and 1048 DF, p-value: < 2.2e-16
# store RMSE and AIC
model2_rmse <- model2 %>%
broom::augment(newdata = spotify_mean_monthly) %>%
mutate(rmse = yardstick::rmse_vec(popularity_month_mean, .fitted)) %>%
select(rmse) %>%
first() %>%
pull()
model2_aic <- AIC(model2)The new model performed better, with an root mean squared error or 9.713 and an AIC of 7816.361. The temporal component improved the model, so we will move forward by visualizing the residuals and considering a cross correlation function lag term on the response.
# visualize residuals
forecast::ggtsdisplay(model2$residuals, points=FALSE)# visualize lag on residuals
forecast::ggCcf(spotify_mean_monthly$popularity_month_mean, model2$residuals)# fit model
model3 = lm(popularity_month_mean ~ danceability_month_mean + energy_month_mean + explicit_month_mean +
instrumentalness_month_mean + loudness_month_mean + tempo_month_mean +
lag(tempo_month_mean, 10) + lag(popularity_month_mean, 10),
data = spotify_mean_monthly)
# output model
summary(model3)
Call:
lm(formula = popularity_month_mean ~ danceability_month_mean +
energy_month_mean + explicit_month_mean + instrumentalness_month_mean +
loudness_month_mean + tempo_month_mean + lag(tempo_month_mean,
10) + lag(popularity_month_mean, 10), data = spotify_mean_monthly)
Residuals:
Min 1Q Median 3Q Max
-16.3356 -1.8929 -0.1428 1.7722 18.1351
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.72306 2.10328 -0.344 0.7311
danceability_month_mean -0.33879 1.67639 -0.202 0.8399
energy_month_mean 2.49666 1.49398 1.671 0.0950 .
explicit_month_mean 4.90253 1.16780 4.198 2.92e-05 ***
instrumentalness_month_mean -3.61600 0.73364 -4.929 9.61e-07 ***
loudness_month_mean 0.07786 0.06159 1.264 0.2064
tempo_month_mean 0.00259 0.01335 0.194 0.8463
lag(tempo_month_mean, 10) 0.02911 0.01197 2.430 0.0152 *
lag(popularity_month_mean, 10) 0.92436 0.01100 84.047 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.505 on 1047 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.9705, Adjusted R-squared: 0.9703
F-statistic: 4311 on 8 and 1047 DF, p-value: < 2.2e-16
# store RMSE and AIC
model3_rmse <- model3 %>%
broom::augment(newdata = spotify_mean_monthly) %>%
mutate(rmse = yardstick::rmse_vec(popularity_month_mean, .fitted)) %>%
select(rmse) %>%
first() %>%
pull()
model3_aic <- AIC(model3)Warning: Removed 10 rows containing missing values (`geom_line()`).
The third model performed even better than the prior two, with an root mean squared error or 3.49 and an AIC of 5656.424. The temporal components improved the model, but we can include an ARIMA component on the errors instead of the cross correlation terms we added in the second and third models.
# Create matrix of predictors
xreg <- cbind(energy_month_mean=spotify_mean_monthly$energy_month_mean,
explicit_month_mean=spotify_mean_monthly$explicit_month_mean,
danceability_month_mean=spotify_mean_monthly$danceability_month_mean,
instrumentalness_month_mean=spotify_mean_monthly$instrumentalness_month_mean,
loudness_month_mean=spotify_mean_monthly$loudness_month_mean,
tempo_month_mean=spotify_mean_monthly$tempo_month_mean
)
# Remove intercept and isolate response variable
xreg <- xreg[,-1]
popularity <- ts(spotify_mean_monthly$popularity_month_mean)
# fit model
model4 <- auto.arima(popularity, xreg=xreg)
# output model
summary(model4)Series: popularity
Regression with ARIMA(0,1,2) errors
Coefficients:
ma1 ma2 drift explicit_month_mean danceability_month_mean
-0.8911 0.0523 0.0646 0.6546 0.2525
s.e. 0.0314 0.0313 0.0128 0.9501 1.2202
instrumentalness_month_mean loudness_month_mean tempo_month_mean
-1.3288 0.0415 -0.0035
s.e. 0.5643 0.0362 0.0094
sigma^2 = 6.721: log likelihood = -2522.29
AIC=5062.58 AICc=5062.75 BIC=5107.32
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.001093683 2.581433 1.819034 NaN Inf 0.7843184 0.0006724013
# store AIC
model4_aic <- AIC(model4)
# visualize predicted versus actual
spotify_mean_monthly %>%
mutate(fitted = model4$fitted) %>%
ggplot(aes(y = popularity_month_mean, x = month, color)) +
geom_line() +
geom_line(aes(y = fitted, color = "pink")) +
labs("Predicted vs. Observed, Model 4")The ARIMA model included alongside the multiple linear regression performed the best in terms of AIC, yielding an AIC of 5062.581.
Discussion
Discussion goes here…